In [None]:
# Install required packages
!pip install --upgrade --quiet natural-pdf[ocr-export,ai]
!pip install --upgrade --quiet easyocr
!pip install --upgrade --quiet surya-ocr

print('✓ Packages installed!')

**Slides:** [slides.pdf](./slides.pdf)

# OCR: Recognizing text

Sometimes you can't actually get the text off of the page. It's an *image* of text instead of being actual text.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.show()

Looks the same as the last one, right? But when we try to extract the text...

In [None]:
text = page.extract_text()
print(text)

Nothing! **It's time for OCR.**

There are a looooot of OCR engines out there, and one of the things that makes Natural PDF nice is that it supports multiples. Figuring out which one is the "best" isn't as tough when you can just run them all right after each other.

The default is [EasyOCR](https://github.com/JaidedAI/EasyOCR) which usually works fine. But what happens when we try it with this document?

In [None]:
page.apply_ocr()

It does pretty well! The only issue is it gives me **Durham's Pure Leaf Lardl** instead of **Durham's Pure Leaf Lard!** I don't need to know why, though, really, because I can just try some other engine! You can also fool around with the options - some of the the lowest-hanging fruit is increasing the resolution of the OCR. The default at the moment is 150, you can try upping to 300 for (potentially) better results.

To fix this I'll both up the resolution and try another OCR engine, [surya](https://github.com/datalab-to/surya).

## Correcting OCR

While we love OCR when it works, it often does *not* work great. We have a few solutions: send humans after it, or use LLMs or spell check to correct it.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")

page = pdf.pages[0]
# We'll OCR this with the worst possible resolution
page.apply_ocr('surya', resolution=15)

After we apply OCR we can export to a magic format that we can display and fix up separately!

In [None]:
text = page.extract_text()
print(text)

### With LLMs

Let's see what our text looks like.

In [None]:
page.find_all('text').inspect()

Some of these are pretty easy - for example, "Uraanitary Warking Conditions" should be "Unsanity working conditions." OCR tools just don't know that kind of thing! But what if we could go through each piece of text, some some sort of spell check or something?

You can use `correct_ocr` to change the text in a region.

In [None]:
def correct_text_region(region):
    return "This is the updated text"
    
page.correct_ocr(correct_text_region) 

And then, magically, all of our text is whatever we `return`.

In [None]:
page.find_all('text').inspect()

But clearly we don't want the same thing every time! Let's add the bad OCR back in...

In [None]:
# Re-apply the OCR to break it again
page.apply_ocr('surya', resolution=20)

...and feed each line to an LLM trying to fix it.

In [None]:
from openai import OpenAI
from natural_pdf.ocr.utils import direct_ocr_llm

client = OpenAI(api_key='sk-proj--......')

prompt = """
Correct the spelling of this OCR'd text, a snippet of a document.
Preserve original capitalization, punctuation, and symbols. 
Changing meaning is okay if it's clearly an OCR issue.
Do not add any explanatory text, translations, comments, or quotation marks around the result.
"""

def correct_text_region(region):
    text = region.extract_text()
    completion = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {
                "role": "system", "content": prompt
            },
            {
                "role": "user",
                "content": text
            },
        ],
    )

    updated = completion.choices[0].message.content

    if text != updated:    
        print(f"OLD: {text}\nNEW:{updated}") 

    return updated

page.correct_ocr(correct_text_region) 

And now we can use `.extract_text()` the magicaly same way.

The real benefit of this vs sending the whole document to the LLM is *we don't change where the text is*. An LLM might OCR something for us, but it *loses the spatial context that we find so important*.

In [None]:
text = page.extract_text()
print(text)

## Let's do the OCR with the LLM, period

But if the LLM is *that good* at OCR, we can also find pieces of the page we would like to OCR and *send them each in isolation to the LLM*. We use `detect_only=True` so it doesn't try to figure out what the text is, just that the text is there.

In [None]:
page.apply_ocr('surya', detect_only=True)
page.find_all('text').show()

In [None]:
page.find_all('text').inspect()

Now we'll do an even fancier `correct_text_region`: it takes the region as an image, and sends it right on over to the LLM for OCR.

In [None]:
from openai import OpenAI
from natural_pdf.ocr.utils import direct_ocr_llm

client = OpenAI(api_key='API_KEY_GOES_HERE')

prompt = """OCR this image patch. Return only the exact text content visible in the image. 
Preserve original spelling, capitalization, punctuation, and symbols.
Fix misspellings if they are the result of blurry or incorrect OCR.
Do not add any explanatory text, translations, comments, or quotation marks around the result.
If you cannot process the image or do not see any text, return an empty space.
The text is from an inspection report of a slaughterhouse."""
# The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers

def correct_text_region(region):
    # Use a high resolution for the LLM call for best accuracy
    return direct_ocr_llm(
        region, 
        client, 
        prompt=prompt, 
        resolution=150, 
        model="gpt-4o" 
    )

page.correct_ocr(correct_text_region) 

What do we have now?

In [None]:
page.find_all('text').inspect()

In [None]:
text = page.extract_text()
print(text)

## Finding tables on OCR documents

When we used `page.extract_table()` last time, it was easy because there were all of these `line` elements on the page that pdfplumber could detect and say "hey, it's a table!" For the same reason that there's no *real* text on the page, there's also no *real* lines on the page. Instead, we're going to do a fun secret trick where we look at what horizontal and vertical coordinates *seem* like they might be lines by setting a threshold.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page.apply_ocr()
page.show(width=900)

In [None]:
table_area = (
    page
    .find('text:contains(Violations)')
    .below(
        until='text:contains(Jungle)',
        include_endpoint=False
    )
)
table_area.show(crop=True)

In [None]:
from natural_pdf.analyzers.guides import Guides

guides = Guides(table_area)
guides.vertical.from_lines(threshold=0.4)
guides.horizontal.from_lines()
guides.show()

Now we can add the lines and use them to detect the table.

In [None]:
df = guides.extract_table().to_df()
df.head()

### Figuring out information about things that are *not* text

In a tiny preview of the next notebook: **what about those checkboxes?** Turns out we can use **image classification AI** to do it for us in the next notebook!